Approximate String Matching with Ordered q-Grams
نویسندگان
چکیده
Approximate string matching with k differences is considered. Filtration of the text is a widely adopted technique to reduce the text area processed by dynamic programming. We present sublinear filtration algorithms based on the locations of q-grams in the pattern. Samples of q-grams are drawn from the text at fixed periods, and only if consecutive samples appear in the pattern approximately in the same configuration, the text area is examined with dynamic programming. The algorithm LEQ searches for exact occurrences of the pattern q-grams, whereas the algorithm LAQ searches for approximate occurrences of them. In addition, a static variation of LEQ is presented. The focus of the paper is on combinatorial properties of the sampling approach. ACM CCS
منابع مشابه
Better Filtering with Gapped q-Grams
A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we repo...
متن کاملImproved Approximate Multiple Pattern String Matching using Consecutive Q Grams of Pattern
String matching is to find all the occurrences of a given pattern in a large text both being sequence of characters drawn from finite alphabet set. This problem is fundamental in computer Science and is the basic need of many applications such as text retrieval, symbol manipulation, computational biology, data mining, and network security. Bit parallelism method is used for increasing the proce...
متن کاملExtending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance
There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of r...
متن کاملAn edit operation-based approach to approximate string matching in large DNA databases
In DNA related research, due to various environment conditions, mutations occur very often, where a mutation is defined as a heritable change in the DNA sequence. Therefore, approximate string matching is applied to answer those queries which find mutations. The problem of approximate string matching is that given a user specified parameter, k, we want to find where the substrings, which could ...
متن کاملComputing the Threshold for q-Gram Filters
A popular and much studied class of filters for approximate string matching is based on finding common q-grams, substrings of length q, between the pattern and the text. A variation of the basic idea uses gapped q-grams and has been recently shown to provide significant improvements in practice. A major difficulty with gapped q-gram filters is the computation of the so-called threshold which de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Nord. J. Comput.
دوره 11 شماره
صفحات -
تاریخ انتشار 2004